June 6, 2017

Lesson context

UBC MDS Curriculum

UBC MDS Curriculum

DSCI 561 Regression I

(taken from public description of UBC MDS course DSCI 561)

Course Learning Outcomes

By the end of the course, students are expected to be able to:

  • Fit and interpret a linear regression model.
  • Identify whether a linear regression model is appropriate for a given dataset.
  • Critique a specific regression model applied to a given dataset on the basis of both diagnostic plots and hypothesis tests.
  • Specify and interpret interaction terms and nonlinear terms.

Sample lesson: Simple linear regression models in R

1/4 of the way into the course. Before this lesson, students will have learned the following:

  • model notation in R
  • one-way & two-way ANOVA
    • theory
    • how to do in R with aov()
    • how to do in R with lm() (including reference-treatment parameterization)
    • interaction effects
  • linear regression via ordinary least squares (theory)

Sample lesson: Simple linear regression models in R

learning objectives:

By the end of this lesson students are expected to be able to:

  • fit a simple linear model in R
  • interpret the output of the simple linear model object in R
  • use three functions from the broom package extract simple linear model object output

Lesson Motivation

Sample Lesson

The Data:

US property data from 2015 extracted from the Data USA API using Python
https://datausa.io/

display_name income mean_commute_minutes median_property_value non_eng_speakers_pct owner_occupied_housing_units pop
Texas 53207 24.5090 136000 0.3503480 0.622325 26538614
Pennsylvania 53599 25.2695 166000 0.1064820 0.692052 12779559
South Carolina 45483 23.0552 139900 0.0686766 0.685914 4777576
New Hampshire 66779 25.2973 237300 0.0786821 0.709609 1324201
Kansas 52205 18.2779 132000 0.1130530 0.666891 2892987
Hawaii 69515 25.5813 515300 0.2521790 0.569030 1406299
## [1] 52  7

Simple linear regression:

US State property value as a function of income

\(Y = \textrm{State}\: \textrm{median}\: \textrm{property}\: \textrm{value}\)
\(X_{1} = \textrm{income}\)

\(\textrm{State}\: \textrm{median}\: \textrm{property}\: \textrm{value} = \beta_{0} + \beta_{1}\textrm{income} + \varepsilon\)

Syntax for linear regression in R

\(\textrm{State}\: \textrm{median}\: \textrm{property}\: \textrm{value} = \beta_{0} + \beta_{1}\textrm{income} + \varepsilon\)

prop_model <- lm(median_property_value ~ income, data = prop_data)

Syntax for simple linear regression

create linear model object and view output in base R:

prop_model <- lm(median_property_value ~ income, data = prop_data)
summary(prop_model)
## 
## Call:
## lm(formula = median_property_value ~ income, data = prop_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -65249 -36542  -6990   8003 219312 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.503e+05  4.393e+04  -3.422  0.00125 ** 
## income       6.420e+00  7.999e-01   8.026 1.52e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 58860 on 50 degrees of freedom
## Multiple R-squared:  0.563,  Adjusted R-squared:  0.5542 
## F-statistic: 64.41 on 1 and 50 DF,  p-value: 1.517e-10

Decoding R’s output from summary():

Extracting output from model object in base R:

model output code
parameter/coefficient estimates (e.g., \(\beta_{0}\) & \(\beta_{1}\)) model_object$coefficients
residuals model_object$residuals
predicted/fitted values model_object$fitted.values
p-values for coefficients summary(model_object)$coefficients[,4]
\(\sigma\) estimate summary(model_object)$sigma
\(R^{2}\) summary(model_object)$r.squared

Working with model output in base R,

the good, the bad and the ugly…

The good…

  • all the information you want is viewable
  • this has been used for many many many years and thus should be familiar to most Data Scientists and Statisticians

The bad…

  • inconsistent syntax to extract model output
  • model output is returned in a variety of forms, and is not tidy data

The ugly…

  • bizarre symbols in some column names (e.g., summary(model_object)$coefficient)
  • F-statistic p-value is never stored in memory and thus must be calculated by hand

broom:

A better way for working with model output in R

broom for working with model output in R:

The good…

  • all the information you want is stored in memory, and easy to access
  • consistent syntax
  • no weird column names
  • output is returned as data frames in tidy data format

The bad…

  • it’s new, so not everyone is familiar with it

The ugly…

  • ???

broom for working with model output in R:

broom function model output
tidy(model_object) model parameters (e.g., coefficients and p-values)
augment(model_object) model data (e.g., residuals and predicted values)
glance(model_object) model quality, complexity and summaries (e.g., \(\sigma\) estimate and \(R^{2}\))

Example output from tidy()

tidy(prop_model)
##          term      estimate    std.error statistic      p.value
## 1 (Intercept) -1.503050e+05 4.392739e+04 -3.421670 1.248896e-03
## 2      income  6.420101e+00 7.999334e-01  8.025795 1.517389e-10

Example output from augment()

augment(prop_model)
##    median_property_value income   .fitted   .se.fit     .resid       .hat
## 1                 136000  53207 191289.28  8184.217 -55289.283 0.01933481
## 2                 166000  53599 193805.96  8167.204 -27805.963 0.01925451
## 3                 139900  45483 141700.42 10610.207  -1800.420 0.03249626
## 4                 237300  66779 278422.90 13107.757 -41122.898 0.04959552
## 5                 132000  52205 184856.34  8281.684 -52856.341 0.01979808
## 6                 515300  69515 295988.30 14882.799 219311.705 0.06393739
## 7                 162900  47583 155182.63  9624.070   7717.367 0.02673642
## 8                 193500  47169 152524.71  9803.561  40975.289 0.02774300
## 9                 160300  44963 138361.97 10880.682  21938.033 0.03417416
## 10                283400  59269 230207.94  9201.822  53192.063 0.02444181
## 11                153800  57181 216802.77  8559.790 -63002.765 0.02115007
## 12                237300  51243 178680.20  8446.070  58619.796 0.02059184
## 13                129200  53183 191135.20  8185.648 -61935.200 0.01934157
## 14                122400  49576 167977.89  8882.875 -45577.895 0.02277680
## 15                129900  49429 167034.14  8929.926 -37134.140 0.02301873
## 16                148100  49620 168260.38  8869.046 -20160.379 0.02270594
## 17                247800  60629 238939.27  9752.014   8860.725 0.02745202
## 18                159000  47507 154694.71  9656.419   4305.295 0.02691646
## 19                286900  74551 328319.93 18384.621 -41419.925 0.09756522
## 20                217500  55176 203930.46  8220.159  13569.538 0.01950501
## 21                259500  61062 241719.18  9945.789  17780.821 0.02855382
## 22                167500  50255 172337.14  8682.917  -4837.144 0.02176291
## 23                124200  49255 165917.04  8987.290 -41717.042 0.02331542
## 24                123200  43740 130510.18 11550.947  -7310.184 0.03851420
## 25                144100  45047 138901.26 10836.365   5198.744 0.03389635
## 26                173800  49331 166404.97  8962.014   7395.030 0.02318445
## 27                186200  61492 244479.82 10146.265 -58279.822 0.02971653
## 28                138400  48173 158970.49  9382.549 -20570.493 0.02541133
## 29                133200  52997 189941.06  8198.252 -56741.062 0.01940118
## 30                238000  56852 214690.55  8484.221  23309.448 0.02077828
## 31                142100  45219 140005.51 10746.364   2094.487 0.03333564
## 32                245000  65015 267097.84 12035.754 -22097.839 0.04181502
## 33                165800  53357 192252.30  8176.291 -26452.298 0.01929738
## 34                194800  58840 227453.71  9048.489 -32653.714 0.02363403
## 35                120500  19350 -26076.09 28861.891 146576.088 0.24045578
## 36                117900  46879 150662.88  9933.937 -32762.882 0.02848581
## 37                250000  72515 315248.60 16940.706 -65248.599 0.08284164
## 38                231500  60509 238168.86  9699.815  -6668.863 0.02715893
## 39                173700  51847 182557.94  8334.941  -8857.945 0.02005353
## 40                111400  41371 115300.96 12961.218  -3900.964 0.04849281
## 41                173800  57574 219325.87  8659.682 -45525.865 0.02164660
## 42                385500  61818 246572.78 10303.309 138927.225 0.03064355
## 43                475800  70848 304546.29 15785.281 171253.710 0.07192673
## 44                270500  70331 301227.10 15432.775 -30727.098 0.06875016
## 45                154900  46868 150592.26  9938.956   4307.740 0.02851459
## 46                333100  68563 289876.36 14252.125  43223.641 0.05863338
## 47                103800  41751 117740.60 12726.528 -13940.602 0.04675258
## 48                215900  60727 239568.44  9795.134 -23668.445 0.02769532
## 49                125500  43623 129759.03 11617.360  -4259.032 0.03895835
## 50                140500  50957 176844.05  8507.762 -36344.055 0.02089375
## 51                103100  39665 104348.27 14047.630  -1248.271 0.05696286
## 52                315900  72093 312539.32 16645.694   3360.684 0.07998149
##      .sigma      .cooksd  .std.resid
## 1  58918.37 8.870254e-03 -0.94857880
## 2  59320.33 2.233837e-03 -0.47703760
## 3  59455.21 1.624170e-05 -0.03109857
## 4  59149.62 1.340135e-02 -0.71667509
## 5  58964.59 8.308867e-03 -0.90705194
## 6  49863.41 5.065524e-01  3.85125431
## 7  59445.28 2.426254e-04  0.13290666
## 8  59158.67 7.111988e-03  0.70603190
## 9  59370.20 2.544786e-03  0.37926355
## 10 58955.92 1.048761e-02  0.91498312
## 11 58755.71 1.264604e-02 -1.08191811
## 12 58850.56 1.064662e-02  1.00636443
## 13 58780.62 1.113492e-02 -1.06260413
## 14 59089.83 7.151043e-03 -0.78333984
## 15 59213.05 4.799670e-03 -0.63829742
## 16 59384.37 1.394576e-03 -0.34648059
## 17 59441.93 3.288868e-04  0.15265344
## 18 59452.52 7.604647e-05  0.07415162
## 19 59128.61 2.966454e-02 -0.74078851
## 20 59423.55 5.391892e-04  0.23282799
## 21 59399.90 1.380658e-03  0.30650339
## 22 59451.68 7.680021e-05 -0.08309211
## 23 59149.19 6.139276e-03 -0.71718167
## 24 59446.25 3.213271e-04 -0.12666297
## 25 59450.98 1.416635e-04  0.08986269
## 26 59446.18 1.917818e-04  0.12712369
## 27 58851.94 1.547366e-02 -1.00522143
## 28 59381.22 1.633916e-03 -0.35401923
## 29 58889.60 9.375535e-03 -0.97351940
## 30 59360.48 1.699289e-03  0.40020665
## 31 59455.01 2.258750e-05  0.03619367
## 32 59368.26 3.209881e-03 -0.38354645
## 33 59333.21 2.026311e-03 -0.45382412
## 34 59268.06 3.815341e-03 -0.56146036
## 35 54384.95 1.292442e+00  2.85745521
## 36 59265.86 4.675737e-03 -0.56474235
## 37 58653.71 6.051431e-02 -1.15755561
## 38 59447.94 1.841993e-04 -0.11487451
## 39 59442.04 2.364872e-04 -0.15202837
## 40 59453.04 1.176394e-04 -0.06794519
## 41 59091.09 6.765037e-03 -0.78199354
## 42 55934.29 9.084530e-02  2.39738865
## 43 53759.43 3.534772e-01  3.02024164
## 44 59281.53 1.080289e-02 -0.54098008
## 45 59452.51 8.091875e-05  0.07425473
## 46 59114.19 1.784127e-02  0.75689352
## 47 59420.79 1.443155e-03 -0.24258925
## 48 59356.82 2.368628e-03 -0.40781329
## 49 59452.55 1.104317e-04 -0.07381296
## 50 59223.80 4.155077e-03 -0.62403838
## 51 59455.50 1.440480e-05 -0.02183922
## 52 59453.68 1.540307e-04  0.05952812

Example output from glance()

glance(prop_model)
##   r.squared adj.r.squared    sigma statistic      p.value df    logLik
## 1 0.5629882      0.554248 58858.23  64.41339 1.517389e-10  2 -643.8752
##       AIC      BIC     deviance df.residual
## 1 1293.75 1299.604 173214534971          50

Group challenge question:

What did we learn today?

Where do we go from here?

  • finish simple linear regression
    • predicted values and confidence intervals
  • more complex linear regression models (multiple regression)
    • theory (least squares)
    • how to in R (revisit reference-treatment parameterization)
    • interaction effects in linear regression
  • model diagnostics

  • dealing with nonlinear terms

Questions/Discussion